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DNA copy number aberrations (CNAs) are a hallmark of cancer genomes. However, little is known 
about how such changes affect global gene expression. We develop a modeling framework, EPoC 
(Endogenous Perturbation analysis of Cancer), to (1) detect disease-driving CNAs and their effect 
on target mRNA expression, and to (2) stratify cancer patients into long- and short-term survivors. 
Our method constructs causal network models of gene expression by combining genome-wide 
DNA- and RNA-level data. Prognostic scores are obtained from a singular value decomposition of the 
networks. By applying EPoC to glioblastoma data from The Cancer Genome Atlas consortium, we 
demonstrate that the resulting network models contain known disease-relevant hub genes, reveal 
interesting candidate hubs, and uncover predictors of patient survival. Targeted validations in four 
glioblastoma cell lines support selected predictions, and implicate the p53-interacting protein 
Necdin in suppressing glioblastoma cell growth. We conclude that large-scale network modeling 
of the effects of CNAs on gene expression may provide insights into the biology of human cancer. 
Free software in MATLAB and R is provided. 
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Introduction 

Gains and losses of chromosomal material that alter DNA copy 
number are a hallmark of cancer genomes. At the level of 
a single locus, the effects of a copy number aberration (CNA) 
are well known: on average, increased copy number (gene 
amphfication) leads to increased gene expression, decreased 
copy number (gene deletion) leads to decreased gene 
expression (Pollack et aU 2002; Lee et aU 2008; Nilsson et aU 
2008). However, CNAs also affect the expression of genes 
located outside the amphfied/deleted region itself via indirect 
mechanisms. For example, deletion of a transcriptional 
repressor may increase the expression of its targets, amphfica- 
tion of a kinase may drive a signaling cascade, and so on. Our 
knowledge of how CNAs affect gene expression at a genome- 
wide level is limited. 

Global network modeling of expression and copy number 
changes can elucidate such causal connections, and prove 
helpful in the study of several key problems in cancer biology. 
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Specifically, such models may (1) identify functionally 
important genes whose perturbations have a significant and 
dispersed impact on transcription; (2) facilitate the discovery 
of possible therapeutic targets by matching model-identified 
key regulators or their targets to pharmacological databases; 
and (3) assist in the identification of distinct CNA and mRNA 
features that are predictive of patient prognosis or therapeutic 
response. 

Current methods for transcriptional network modeling 
seemingly fall into three main categories. One common 
approach is to derive models from mRNA profiles only, using, 
e.g., gene-gene (partial) correlations, Bayesian networks, 
ordinary differential equations or mutual information 
(Friedman et aZ, 2000; Schafer and Strimmer, 2005; Margolin 
et aU 2006; Bansal et aU 2007; Opgen-Rhein and Strimmer, 
2007). A second common technique is to construct models of 
mRNA expression from targeted perturbation experiments, as 
controlled perturbations strongly facilitate causal inference 
(Yeung et al 2002; Tegner et aU 2003; di Bernardo et aU 2005; 
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Bansal et al, 2007; Bonneau et al, 2007; Lehar et al, 2007; 
Nelander et al, 2008; Lauria et al, 2009). A third alternative is 
to use the naturally occurring genetic variation in a separating 
population to study the relationship between genotype and 
expression phenotype (Jansen, 2003; Lee et al, 2006, 2009; 
Rockman, 2008; Suthram et al, 2008; Zhu et al, 2008; see also 
Discussion). Here, we focus on the role of acquired genetic 
variations in tumors, specifically CNAs, and ask how these 
can be used to derive transcriptional networks. CNAs are 
prevalent in several human cancers, and tend to appear in a 
patient-specific and multifactorial manner in the tumors, 
which resembles an optimal experimental design to derive 
causality (Fisher, 1926). 

We present a global model of CNA-driven transcription 
in the brain tumor ghoblastoma. The model is derived using 
EPoC (Endogenous Perturbation analysis of Cancer), a 
computational method that constructs network models of 
mRNA expression, viewing CNAs as informative system 
perturbations introduced endogenously during the evolution 
of the tumor, and the corresponding mRNA profiles as the 
steady-state response to that perturbation. We apply EPoC to 
ghoblastoma data from The Cancer Genome Atlas (TCGA) 
consortium. Previous analyses of ghoblastoma have revealed 
altered pathways and disease subtypes (Pollack et al, 2002; 
Freije et al, 2004; Phillips et al, 2006; Tso et al, 2006; Lee et al, 
2008; TCGA-Consortium, 2008; Dahlback et al, 2009; Verhaak 
et al, 2010; Cerami et al, 2010) and networks of correlating 
transcripts (Carro et al, 2010) (ARACNE). Key examples 
of CNA/mRNA analyses for other tumors include clustering 
and modular network modeling, leading to the discovery 
of regulators such as MITF, RAB27A and TBC1D16 in 
malignant melanoma (Garraway et al, 2005; Akavia et al, 
2010), and linkage analysis to reveal the association of cMYC 
amphfication to wound healing signatures in breast cancer 
(Adler et al, 2006). Network analysis of 654 selected breast 
cancer transcripts and 384 genomic regions has identified a 
candidate regulatory region on chromosome 17 (Peng et al, 
2008). Canonical correlation analysis (CCA) has also been 
put forth as an alternative non-network approach to inte- 
grating DNA/mRNA data (Waaijenborg et al, 2008; Witten 
et al, 2009). 

We use EPoC to construct a gene-level model, which 
encompasses 10 672 genes, causally connecting CNAs to 
expression changes in ghoblastoma. First, we estabhsh that 
the parameters of the EPoC network model can be robustly 
estimated from paired genome-wide DNA- and RNA-level 
data from a set of tumors, using a combination of lasso 
regression and bootstrap. Second, we show that a novel 
score, based on a sparse singular value decomposition of 
the derived CNA-mRNA network model, identifies prognostic 
biomarkers capable of clinical stratification into short-term 
and long-term survivors. Third, EPoC identifies key mecha- 
nisms (disease-driving CNAs), which we assess by chemo- 
informatic analyses and comparisons to known biological 
pathways, revealing the likely existence of short regulatory 
paths between EPoC hubs and targets, as well as 15 candidate 
drug targets. We confirm a candidate hub, the p5 3 -interacting 
protein Necdin, NDN, in U87MG, U373MG, U343MG and 
T98 ghoma-derived cell lines by experimentally testing a 
small transcriptional network around NDN, receptor tyrosine 
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kinases EGF receptor [EGFR] and platelet-derived growth 
factor receptor alpha [PDGFRA). Finally, we demonstrate 
rapid and consistent performance of EPoC in comparison 
with mRNA- only methods, standard expression quantitative 
locus (eQTL) methods and two recent multivariate methods 
for genotype-mRNA coupling (Peng et al, 2008; Lee et al, 
2009). 

Results 

Modeling copy number-dependent transcription 
in tumors 

Transcriptional and CNA-driven networks 

To connect mRNA levels with DNA copy number in 
ghoblastoma tumors, we adapt a common model for mRNA 
transcription regulation and turnover. This model formulation, 
related to the so-cahed S-system (Savageau, 1969, 1976; 
Crampin et al, 2004), takes the form of sets of differential 
equations: 

synthesis decay 

^=u,a,f[^-'-P,f[^M = l,...,n, (1) 

7=1 i=i 

where n is the number of genes, dy^/dt and y^, i=l,2, ...,n 
denote the change rate and average mRNA concentrations in 
a tumor respectively, and Uj^O, i=l,2, ...,n the average 
number of gene copies corresponding to a particular transcript 
(Figure IB). Equation (1) states that the change rate of 
transcript y^ is the difference between its synthesis rate and its 
decay rate. The synthesis rate is determined by the number of 
copies of the gene's DNA, Ui, the regulatory effects of other 
genes, Wtj and a gene-specific synthesis constant, a^. Similarly, 
the decay rate is determined by the regulatory effects of other 
genes, and a gene-specific decay constant, p^. Obviously, the 
assumption of proportionality on Ut is a simphfication and 
unlikely to hold for ah genes in the genome (e.g., gene copies 
may generate transcripts at different rates due to epigenetic 
differences). Nevertheless, recent data indicate that it is a 
reasonable approximation for a large proportion of genes in 
the genome (Nilsson et al, 2008). 

The procedure used to estimate the model parameters in 
Equation (1) is described in detail in Materials and methods. 
In short, assuming steady-state conditions, the log-transformed 
and zero-centered mRNA and CNA profiles of ghoblastoma can 
be summarized by two mutually complementing linear 
systems. The first of these represents the transcriptional 
network (A): 

AA7 + A[/ + i^ = 0, (2) 

where AY and AU are stack matrices of log-transformed and 
zero-centered mRNA and CNA profiles of glioblastoma, 
respectively, and R (defined by the a's and (3's of the original 
model. Materials and methods) is a matrix that captures the 
effects on transcription of non-CNA perturbations in individual 
tumors (e.g., SNPs, sequence mutations or environmental 
effects). The transcriptional network A={ay} relates to the 
original model by aij=Wij—Vij, meaning its elements 
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Figure 1 Overview of the EPoC modeling frameworl<. (A) Using genome-wide, paired mRNA- and DNA-level data as input, EPoC generates a quantitative 
causal network model of the global effects of copy number aberrations on mRNA expression. The resulting model is subsequently used to predict disease-driving 
genes and prognostic indicators. (B) EPoC is based on systems of differential equations that take into account that the transcription of a gene is determined both 
by its own DNA copy number (straight arrows) and the product of other genes (bent arrows). (C) Our method generates two mutually complementary networks 
denoted as A and G. The A network captures transcript-transcript interactions (left), whereas the G network contains the direct and indirect effects of CNA pertur- 
bations on transcription (middle). The singular value decomposition of G can be used to identify the CNAs whose perturbations are maximally amplified by the 
network (i.e., they have a strong overall transcriptional effect; yellow nodes), and the mRNA transcripts whose expression are most altered by these perturbations 
(green nodes; right panel). 
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represents the net influence from transcript j to transcript i; 
aij>0 indicates activation of transcription i by transcript 
j, ay <0 inhibition, and the magnitude the strength of the 
interaction. 

The second representation is termed the CNA-driven 
network (G): 

AY = GAU + r. (3) 

G={gij} consists of CNA-mRNA couplings: gy>0 indicates 
CNA-driven transcriptional activation (i.e., transcription of 
gene i is increased because the copy number of gene; has been 
altered), gy<0 CNA-driven transcriptional inhibition, and the 
magnitude of gy the strength of the interaction. This network is 
related to the first as G=-A~^ and the topologies of the two 
networks are thus related (Figure IC). However, while A 
reflects direct transcriptional interaction, corrected for the 
impact of a transcript's own CNA, G models how the effects of 
CNA perturbations propagate through the system to produce 
their steady-state responses and should contain key disease- 
driving CNAs as hubs, as well as their downstream targets 
(Figure IC). 

To identify the transcriptional interactions (non-zero ele- 
ments in A) and the CNA-mRNA couplings (non-zero 
elements in G), we need to solve the large linear equation 
systems ((2) and (3)}. We use a gene-level lasso regression 
approach paired with cross-vahdation and bootstrap to 
robustly identify these network parameters (Materials and 
methods). 



Survival scores derived from network decompositions 

We next describe how survival scores can be derived from 
the EPoC model, based on a particular interpretation of the 
CNA-driven network as a signal amplifier. From a systems 
perspective, it is natural to view the copy number profile as 
the input to the system G, whereas the gene expression 
profile is the corresponding output. One common way to 
summarize a system's input-output behavior is to compute 
the main axes of signal gain, defined as the singular value 
decomposition G=CAD^ (Golub and Loan, 1996; Skogestad 
and Postlethwaite, 2005; Nordling and Jacobsen, 2009} 
(Materials and methods). When apphed to the CNA-driven 
network G, this decomposition should reveal CNA pertur- 
bations that are strongly amplified by the system (in the 
leading columns of D), as well as the transcripts which 
are most affected by CNA perturbations (in the leading 
columns of C) (Figure IC). We use sparse SVD (Zou 
et aU 2006), which ensures that only a small subset of 
perturbations and transcripts are present in the leading 
columns (Materials and methods). Once estimates of C and 
D have been obtained, EPoC computes the level of signal 
amphfication in each tumor by the scalar projection scores 
Zy^C^AYand Zu=D^AU (Materials and methods). Concisely 
put, these scores summarize the total burden of molecular 
changes consistent with the CNA-driven network, and should 
therefore correlate with clinical survival. Below, we confirm 
this conjecture for the patients in the TCGA glioblastoma 
cohort. 
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Global CNA-driven networks of glioblastoma 

EPoC finds 512 robust associations between CNAs 
and mRNAs in glioblastoma 

We proceed to estimate EPoC networks for human ghoblasto- 
ma. We use CNA- and mRNA-level data (10 672 matched genes, 
186 patients) provided by the TCGA consortium (TCGA- 
Consortium, 2008). Before estimating the network, EPoC 
apphes a filter to select possible CNA regulators in the data 
(defined as genes that are recurrently amphfied or deleted 
across the patients; Materials and methods). In total, we keep 
2640 genes as possible CNA regulators, whereas we keep all 
10 672 genes as possible targets and/or transcriptional 
regulators. On the basis of these data, network modeling 
proceeds as follows. First, EPoC determines a suitable model 
size, i.e., the number of interactions in the network. For this, 
we have developed a customized procedure utilizing random 
data splits (Figure 2 A) . In brief, the tumors (i.e., the 186 ghoma 
cases) are split into two random groups. We estimate a 
network for each group, and compare the two networks using 
Kendall's W (akin to rank correlation of detected network 
interactions. Materials and methods). This procedure is 
repeated for different network sizes, and we select the network 
size that optimizes W. The optimal network size for the TCGA 
ghoblastoma data is estimated to 200-500 interactions 
(Figure 2A) . We then construct a robust final network of the 
optimal network size using bootstrap: On each of 1000 
bootstrap data sets (resampling from the 186 tumors; Fried- 
man et aU 2000), we generate a network of size around 400 (as 
obtained in Figure 2A) . We retain interactions that appear in at 
least 20% of the 1000 bootstrap networks, a cutoff set well 
above appearance frequencies expected by chance (Figure 2B) . 
This results in a final CNA-driven network with 512 interac- 
tions, of which 263 are stimulatory and 249 are inhibitory 
(Figure 3A). 



CNA hubs that best explain mRNA variability in 
glioblastoma 

The identified CNA-driven network G contains a number of 
copy number-altered genes that control multiple down- 
stream genes (Figure 3 A, Table I). Among these highly 
connected hub genes, we find well-known oncogenes and 
tumor supressors that are frequently deleted or amphfied 
in glioblastoma, including EGFR, PDGFRA, CDKN2A and 
CDKN2B (Figure 3A), confirming a clear association between 
these alterations and transcriptional variability of glioblastoma. 
In addition, EPoC identifies a number of interesting hub genes 
that have not previously been associated with ghoblastoma, 
e.g., MTAP and SEC61G. MTAP is located close to CDKN2A/B 
on chromosome 9, and SEC61G is located close to EGFR on 
chromosome 7, but both MTAP and SEC61G have unique and 
robustly identified targets in our network model. This may 
suggest that they are not mere innocent bystanders (passenger 
mutations), but may have tumorigenic effects of their own. 
Recent work has shown that amphfication of SEC61G leads 
to a more than 10-fold transcriptional induction of this 
gene (Bralten et al 2010); deletion of MTAP is beheved 
to confer sensitivity to purine synthesis inhibitors such as 
SDX-102 (Kindler et al, 2009). Additional hubs include 
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Figure 2 Derivation of robust and optimally sized network models for glioblastoma. (A) To select the network size, we use a customized validation technique in which 
networks generated in random data splits are compared using a rank correlation metric (one minus Kendall's 1/1/). Upper panel: Using this approach, we find glioblastoma 
networks with 200-500 interactions to be the structurally most consistent. The preferred network size is indicated by an asterisk (*) (details in Materials and methods). 
Lower panel: To assess the ability of a model to predict mRNA levels from CNAs, we estimate the normalized sum-of-squares prediction error by 10-fold cross-validation. 
This cross-validation identifies optimal networks of about 1 0 000 interactions. (B) We infer a robust CNA-driven network of size 512 from 1 86 paired gene expression and 
gene copy number profiles provided by The Cancer Genome Atlas (TCGA) consortium. For each of 1000 pseudo-bootstrap data sets, we generate a network of size 
around 400 (as obtained in Figure 2A). The final network retains interactions that appear in at least a fraction f of the bootstrap networks (frequency distribution shown as 
red curve). As a negative control, we permute the patients in the CNA data set (but not in the mRNA data) and repeat the estimation procedure, producing low 
frequencies for all individual interactions (dashed black curve and *). On the basis of these results, we here use f=20% (black line) as a frequency cutoff to generate our 
network model (Figure 3), which is well above frequencies expected by chance. 



interferon alpha 1 [IFNAl), myeloid/lymphoid or mixed- 
lineage leukemia translocated to 10 [MLLTIO, a well-known 
leukemia gene), glutamate decarboxylase 2 {GAD2), a 
postulated glutamate receptor GPR158 and Necdin [NDN, 
pursued below) . As expected, the model does not contain hubs 
to represent copy number neutral ghoma oncogenes/tumor 
suppressors altered by missense, nonsense or frameshift 
mutations [TPSS, ERBB2, NFl, RBI, PIK3R1, PIK3CA; Parsons 
et al 2008; TCGA-Consortium, 2008} . To account for the effects 
of additional types of mutations, we would require a model 
for the non-CNA perturbation term, R, in Equation (2), which 
is reserved for future work. 

Besides nominating disease drivers, the derived network 
itself contains additional useful information. For instance, 
we detect robust CNA-mRNA links between the hubs EGFR, 
PDGFRA, and CHIC2 and target genes that are markers 
of early neural development, such as the ghoblastoma 
stem cell marker GDI 33 (PROMl) and the transcription 
factors SOXIO, SOXll, NR2E1(TLX) and NKX2.2 (Figure 3B} 
(Shi et al 2008; Haslinger et al 2009; Piccirillo et al 
2009). For instance, neural stem cell renewal is under 
epistatic control of both SOXIO and NR2E1(TLX) (Shi et al 
2008), and our model may suggest that PDGFRA and EGFR 
may act as complementary drivers of these two regulators 
(Discussion). Comparing the CNA-driven network with the 
two compound-target databases Drugbank and Ingenuity, 
we nominate a set of compounds with known activities that 
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could, in principle, counteract endogenous perturbations in 
our model (Figure 3C}. 

Phenotypic and transcriptional consequences of hub 
gene perturbation in glioblastoma cell lines 

To assess the biological relevance of a hub gene in the G 
network that has not been previously associated with 
ghoblastoma, we have chosen to perform directed vahdation 
experiments on NDN. This gene has five downstream targets in 
the G network and shares a common target, fibroblast growth 
factor 9 iFGF9; also known as glia-activating factor), with 
PDGFRA which is frequently amplified in glioblastoma 
(Figure 3B}. NDN is maternally imprinted, located on 
chromosome 15, and encodes a p5 3 -interacting protein that 
belongs to the melanoma antigen family (Taniura et al 1998). 
In the TCGA data, NDN is deleted in 29/186 patients. We 
introduce perturbations of NDN by overexpressing the gene in 
four ghoblastoma cell lines (T98G, U-87MG, U-343MG and 
U-373MG}, leading to decreased cell cycling time in all cell 
lines, except T98G (Figure 4A-C} . Using the U-343MG cell line, 
we measure the expression of a set of downstream targets of 
NDN and PDGFRA by qPCR to assess the transcriptional 
response of NDN overexpression and inhibition/stimulation 
of PDGFRA, respectively. The results confirm a set of EPoC 
predictions, including induction of CPNE8 by NDN, induction 
of KCNH8 by PDGF-AA protein dimers (i.e., a PDGFRA agonist) 
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Figure 3 CNA-driven network of glioblastoma. (A) Overall structure of the resulting glioblastoma network, defined as the set of interactions detected in at least 20% of 
the bootstrap networks (Figure 2B). Red arrows represent stimulatory interactions, blue arrows indicate inhibitory interactions. (B) Close-up of a network region 
containing early neural differentiation markers, including glioblastoma stem cell marker CD133/PROM1, under the control of hub genes CHIC2 and PDGFRA. The 
main hubs of the full network are listed in Table I. Note hub gene A/DA/, further analyzed in Figure 4. (C) Close-up of a network region containing genes that are 
targets of pharmaceutical compounds (as determined by searching the Ingenuity and Drugbank databases). Examples of compounds involved in links include 
simvastatin, SDX-102 (selectively active in /WT/AP-deficient tumors), PD173074 (a FGFR3 inhibitor), and cyclooxygenase 1 (C0X1) inhibitors (PTGS1 encodes the 
C0X1 protein). 



and suppression of KCNH8 by Imatinib (a selective inhibitor 
of PDGFRA and other tyrosine kinases). Further, when 
NDN is overexpressed, FGF9 does not respond to PDGFRA 
perturbation. This is not only consistent with the predic- 
tion that NDN and PDGFRA regulate the transcription 
of FGF9 in opposite directions, but may also suggest a 
more comphcated mechanism that is not captured by our 
model because NDN perturbation by itself did not suppress 
FGF9 levels (Figure 4E}. For the other two of the tested 
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transcripts, GALNT13 and ALK, which are both expressed 
at very low levels in U-343MG cells, we did not detect 
any significant changes. Further, we perturbed the activity 
of the EGFR by activating it using one of its ligands 
[EGF] and inhibiting it with a selective EGFR inhibitor 
[Gefitinib). As readout, we measured the transcriptional 
effect on S0CS2 (a modulator of STAT signaling), NR2E1 
(also known as TLX, a transcription factor beheved to be 
important for neural stem cell renewal), yielding results 
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Table I Hubs in the CNA-driven glioblastoma network model-based on 10 672 genes and 186 patients 



Symbol 


Amp 


Del 


Targets 


Chrom 


Pos 


Description 


EGFR 


146 


2 


134 


7 


55054218 


Epidermal growth factor receptor (erythroblastic leukemia viral (v-erb-b) 
oncogene homolog, avian) 


CDKN2B 


6 


108 


46 


9 


21992905 


Cyclin-dependent kinase inhibitor 2B (pi 5, inhibits CDK4) 


CDKN2A 


6 


108 


27 


9 


21957751 


Cyclin-dependent kinase inhibitor 2 A (melanoma, pi 6, inhibits CDK4) 


MTAP 


6 


101 


22 


9 


21792634 


Methylthioadenosine phosphorylase 


SEC61G 


134 


2 


21 


7 


54787434 


Sec61 -gamma subunit 


PDGFRA 


33 


5 


19 


4 


54790203 


Platelet-derived growth factor receptor, alpha polypeptide 


IFNAl 


6 


101 


14 


9 


21430439 


Interferon-alpha 1 


C0MMD3 


8 


130 


10 


10 


22645304 


COMM domain containing 3 


CHIC2 


35 


5 


9 


4 


54570714 


Cysteine-rich hydrophobic domain 2 


GAD2 


8 


130 


9 


10 


26545599 


Glutamate decarboxylase 2 (pancreatic islets and brain, 65 kDa) 


IFNA14 


6 


98 


8 


9 


21191233 


Interferon-alpha 14 


C10orf97 


9 


126 


6 


10 


15860180 


Chromosome 10 open reading frame 97 


MLLTIO 


9 


129 


5 


10 


21863579 


Myeloid/lymphoid or mixed-lineage leukemia (trithorax homolog, Drosophila) ; 
translocated to 10 


PPYRl 


12 


130 


5 


10 


46503539 


Pancreatic polypeptide receptor 1 


WIPI2 


129 


5 


5 


7 


5220425 


WD repeat domain, phosphoinositide interacting 2 


NDN 


1 


29 


5 


15 


21481654 


Necdin homolog (mouse) 


ADARB2 


9 


123 


4 


10 


1218072 


Adenosine deaminase, RNA-specific, B2 (RED2 homolog rat) 


ELAVL2 


7 


91 


4 


9 


23680104 


ELAV (embryonic lethal, abnormal vision, Drosophila) -like 2 (Hu antigen B) 


GNA12 


128 


5 


4 


7 


2734268 


Guanine nucleotide binding protein (G protein) alpha- 12 


ARMC4 


8 


132 


3 


10 


28141104 


Armadillo-repeat containing 4 


IFNA21 


6 


96 


3 


9 


21155635 


Interferon alpha-21 


LANCL2 


120 


2 


3 


7 


55400634 
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compatible with a coupling between hub perturbation and 
transcriptional response (Figure 4F} . 

Taken together, these observations support that the esti- 
mated CNA-driven network is mechanistically informative, 
and that CNA-mRNA links identified by EPoC broadly agree 
with data from relevant validation experiments. Our approach 
operates at the gene level and our data also support that 
individual hub genes can be identified in practice (e.g., NDN). 
Very large aberrations, however, cannot be fully resolved 
by the current modeling strategy (e.g., EPoC identifies a small 
set of candidate hubs in a 7 Mb region on the short arm 
of chromosome 10, which is often lost in its entirety in 
ghoblastoma; see Discussion, Table I). 

CNA-driven networks contain prognostic 
information 

A crucial test for any disease network model is to ask if it 
produces clinically relevant predictions. While it is well 
established that CNA and mRNA patterns may predict survival 
and response to therapy using a range of supervised or 
unsupervised techniques (Broetef aZ, 2009; Zhang et aU 2009; 
Verhaak et aZ, 2010), less work has been done in deriving 
prognostic scores from actual network models. We thus 
proceed to test the patient prognostic value of the CNA-driven 
network G, using the derived survival scores Zy and 
(above) . 
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As predicted, we find that these scores achieve a signi- 
ficant degree of prognostic separation (Figure 5A). In 
contrast, when we examine the prognostic properties of the 
transcriptional network. A, we find no evident survival 
stratification when separating patients along the leading 
components of the SVD of A. We also demonstrate that 
a standard singular value decomposition (SVD) calculated 
from mRNA profiles or CNA profiles fails to detect survival 
differences in the data. We further calculate survival curves 
for the first six components of both mRNA and CNA data 
in the G, A and data SVD cases, revealing that survival 
differences are only seen in the first SVD component of G 
(Table II). 

To visualize the contribution of individual genes to the 
survival scores, we color-code the CNA-driven network G 
(Figure 5B). As an example, from the leading singular vectors 
of G, we note that CNAs in EGFR and PDGFRA are highly 
amphfied by the network system model (yellow nodes) and 
identify the leading mRNA responders to these perturbations 
(green nodes), which include, e.g., growth factor PDGFA, 
glutamate receptor GRIKl , the transcription factor SOXU , and 
the STAT pathway modulator S0CS2. 

We thus conclude that the estimated CNA-driven EPoC 
model correlates with a clinically relevant phenotype. This 
is in general support of model validity, and suggests that 
integrative models may help to identify clinically useful 
ghoblastoma biomarkers. 
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Figure 4 Experimental perturbations of a network region controlled by NDN and PDGFRA. (A-D) NDN overexpression slows the growth of glioblastoma cell lines. 
(A) Interactions in the network around EGFR, A/DA/ and PDGFRA. (B) Perturbation of A/DA/ by stable overexpression in two separate U343-derived cell lines, denoted 
as A/DA/+ (moderate overexpression) and A/DA/+ + (high overexpression). (C) Growth curves collected during 6 days showed that NDN overexpression 
inhibits growth of U343 cells. Error bars indicate 95% confidence intervals. (D) Single-time point (7 days) measurement of cell number in A/DA/-overexpressing cells. Error 
bars indicate s.e.m. (E) Perturbation of PDGFRA by PDGF-AA protein (ligand) and imatinib (STI-571 ; Gleevec™; inhibits PDGFRA and certain other tyrosine kinases), 
respectively, produces opposite responses in target genes KCNH8 and FGF9, which were identified as downstream targets of PDGFRA in the model. NDN 
overexpression induces CPNE8 target genes and modulates FGF9 response to PDGFRA. Error bars indicate 95% confidence intervals of mRNA expression 
log2-relative to untreated controls. (F) Perturbation of EGFR by its ligand EGF and gefitinib (ZD-1839 Iressa™; inhibits EGFR) produces opposite responses in the 
predicted EGFH target genes S0CS2 and NR2E1. 



Technical comparison with mRNA-based and 
eQTL-type methods 

We compare the performance of EPoC to a set of alternative 
network construction methods. We include a set of mRNA- 
only methods based on information theory (ARACNE), partial 
correlations (GeneNet) and sparse estimation of the inverse 
covariance (precision) matrix (glasso). We also consider 
methods based on combinations of genotype and expression 
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data. These include (i) a univariate SNP-eQTL (Stranger et al, 
2007a, b), here using CNAs in place of SNPs; (ii) a recent 
network method termed remMap (Peng et aZ, 2008); and (iii) 
the SNP-eQTL module network solver LirNet (Lee et al 2009), 
here using CNAs genotypes in place of SNPs. remMap and 
LirNet, similar to EPoC, use variants of lasso for model fitting 
and are thus preferred points of comparison. remMap was 
recently proposed to relate genomic-region variations to select 
genes in breast cancer, and LirNet was of late advantageously 
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Glioblastoma prognosis: 
projection onto the CNA-driven network 

Projection of CNA profiles Projection of mRNA profiles 

1 



B 



Projection onto 
the CNA-driven 
network (G) 




Gene-specific 
projection scores 





Days after surgery 

Patients are stratified by: 
projection > 0 (red), and projection <0 (blue) 

Figure 5 Derivation of prognostic scores from the network model. (A) Kaplan-IVIeier curves to assess prognostic scores extracted from tlie CNA-driven network. 
Prognostic scores are computed by a sparse singular value decomposition of the CNA-driven network G (Materials and methods). Patients are divided into two groups 
by projecting their CNA profiles and mRNA profiles onto the main left and right axis of the singular vectors of G, respectively. This separates patients with favorable and 
poor prognosis (upper panels). By contrast, the corresponding analysis of the transcriptional network A (middle panels) does not produce any significant separation 
of patients in terms of survival, nor does a standard singular value decomposition (SVD) of the mRNA profiles or CNA profiles (lower panels). The panels show the 
results obtained by projection onto the first SVD components. The results obtained when projecting onto additional components are given in Table II. (B) The sparse 
singular value decomposition of the CNA-driven network G identifies genes with strong scores for signal amplification, i.e., genes whose perturbations are highly 
amplified by the network system (here illustrated as yellow nodes, e.g., PDGFRA), as well as mRNA transcripts that are most affected by these perturbations 
(green nodes, e.g., GRIK1; Figure 1C). 



compared with a set of eQTL-type methods, including 
Geronemo (Lee et aZ, 2006} and Bayesian networks with eQTL 
priors (Zhu et aU 2008). Details of the comparison are given in 
Materials and methods. 



Model consistency between independent glioblastoma 
data sets 

We identify the subset of 146 patients (out of the 186 patients 
analyzed above) , for which two independent CNA and mRNA 
data sets have been produced at different institutes in the 
TCGA consortium. These technically independent data sets 
provide an ideal setting for an unbiased comparison of the 
methods. We thus apply each method to the two data sets, and 
use Kendall's Wto investigate the consistency between the two 
solutions (Materials and methods). This analysis shows 
stronger performance by EPoC CNA-driven networks G over 
all other methods for all but the largest network sizes 
(Figure 6A), i.e., EPoC G network solutions from two 
technically independent data sets largely agree both in terms 
of detection and estimated strength of network interactions. 

Apart from glasso, methods that incorporate combined 
genomic/transcriptional data perform better than mRNA- 
based networks (ARACNE and GeneNet). We also derive 
transcriptional EPoC A networks (solving Equation (2); 
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Materials and methods). EPoC A corrects transcripts for their 
own CNAs prior to network construction and performs quite 
well, but clearly worse than EPoC G. This is best explained by 
the stronger correlations among mRNAs compared with CNAs 
(predictor variables in EPoC A and EPoC G, respectively), as it 
is well known that regression modeling with highly correlated 
predictors is subject to instability (Breiman, 1996; Skogestad 
and Postlethwaite, 2005; Nordling and Jacobsen, 2009; and 
references therein). While CNAs may exhibit strong correla- 
tion within a genomic region, CNA correlation between 
genomic regions is globally much lower than between mRNAs 
(data not shown). As expected, EPoC G, remMap and LirNet 
perform better than standard eQTL, which likely reflects the 
benefit of a multivariate modeling approach, using regulariza- 
tion (LI in EPoC G, and LI +L2 in remMap and LirNet) over a 
univariate approach (eQTL) . 

Pathway overlap and prediction error 

We proceed to determine the overlap between the derived 
networks and two protein-protein interaction (PPI) and two 
pathway databases. Results from this analysis demonstrate a 
similar ranking of methods as in the robustness tests above 
(Figure 6B, Materials and methods). That is, EPoC G captures 
the most known direct and short-path interactions. In absolute 
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Table II Survival differences 



Network and data type Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 

(A) log-rank test P-values of survival difference between patient with a positive versus negative loading on SVD components 1-6 

EPoCG, CNA *0.0014 0.1129 0.0216 0.1157 0.0853 0.0147 

EPoCG, RNA *0.0004 0.1560 0.0759 0.0818 0.0516 0.0412 

EPoCA, CNA 0.3109 0.1468 0.1100 0.0393 0.2817 0.1069 

EPoCA, RNA 0.2198 0.3526 0.0479 0.2402 0.1570 0.3621 

SVD, CNA 0.0505 0.1266 0.0261 0.0042 0.4677 0.4251 

SVD, RNA 0.0869 0.0963 0.2822 0.1198 0.0225 0.4091 

Subtype Classical Proneural Neural Mesenchymal 

(B) log-rank test P-values of survival difference between patients in different molecular subclasses, as defined in {Verhaak et al, 2010) 
Classical — 

Proneural 0.4517 — 

Neural 0.6744 0.5598 — 

Mesenchymal 0.3600 0.7371 0.5220 — 



Network consistency between two g Enrichment of known PPI and pathway 

independent glioblastoma data sets interactions (HPRD, intAct, NCI, reactome) Key 




Network size Network size 



Figure 6 Method comparisons: network consistency and pathway interactions. (A) We compare network models derived from two full replicate glioblastoma 
data sets (146 identical tumors; same patients and samples) but processed at different centers with slightly different technological setups (Affymetrix and Agilent 
technologies, run at MSKCC, Harvard Medical School and Broad Institute, Materials and methods). This test measures each method's reliability, i.e., its robustness 
to noise and technological factors. EPoC estimation of the CNA-driven network G is the best-performing method on the TCGA data (1 -1/1/ lower, arrow /•). Glasso is 
second best, followed by sparse estimation of the transcriptional network A (EPoC /A), and remMap. LirNet, eQTL, GeneNet and ARACNE all exhibit less 
robust performance compared with EPoC G. (B) We map interactions found by EPoC and other methods to molecular links in the pathway repositories 
HPRD, Reactome, Intact and NCI-nature. Each interaction is characterized by the number of steps minimally needed to 'walk' between the network gene and its 
target (i.e., the shortest path). We argue that a well-estimated network should be comprised of identified interactions that either match known interactions in the 
databases or are enriched for shorter paths. The figure depicts the enrichment (relative proportion of interactions that correspond to a shortest path length of 1 or 2 
interactions in a pooled network based on the four different pathway databases). EPoC G interactions are clearly enriched for short or direct paths in the databases, 
followed by glasso and EPoC A. 



numbers, 65 interactions in the CNA-driven network corre- 
spond to known short-range interactions (shortest path 
between the genes is one or two steps in at least one of the 
databases) . Among the other methods, EPoC A and glasso also 
exhibit a good overlap with known pathways. 

We emphasize robustness in the above analysis, but 
EPoC can also be used to generate network models for 
mRNA prediction from CNA data (Materials and methods). 
We assess the prediction accuracy by computing cross- 
validation errors for EPoC G, remMap and LirNet, which 
all produce natural prediction models. We conclude that 
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EPoC G results in the smallest prediction errors on the 
ghoblastoma data set (Supplementary Figure 2) . 

Qualitative differences and speed 

Additional comparisons (Supplementary information) show 
that the genotype-driven networks and mRNA-based networks 
capture complementary genomic process information in terms 
of structure and gene content: the EPoC CNA-driven network 
has a more hub-oriented structure, with several development- 
associated genes as responders, whereas the transcriptional 

© 2011 EMBO and Macmillan Publishers Limited 



Network modeling of glioblastoma 
R Jornsten et al 



network captures inflammatory and cell cycle processes 
(Supplementary Figure 1). The distinction between CNA- 
driven and transcriptional networks is well illustrated by a 
hierarchical clustering summary, structurally comparing 
networks produced by all the above methods using Kendall's 
W. We see a clear structural separation between mRNA-based 
and genome-type-driven networks (Supplementary Figure 1). 
As a practical consideration, we also demonstrate that the 
scalability and algorithmic speed of EPoC is highly competitive 
(Supplementary information) . 

From these comparisons, we conclude that EPoC exhibits 
excellent performance in terms of model reproducibility, 
validity and algorithmic speed (see Materials and methods 
and Supplementary information). EPoC is primarily of interest 
for derivation of large networks at the single gene level and thus 
complements methods that use different levels of description, 
such as grouping CNAs into whole regions, integrating CNA and 
mRNA events over PPI networks, or by condensing transcripts 
into modules or linear superpositions (below). In Materials 
and methods and Supplementary information, we discuss the 
relative performance of the methods in more detail, including 
plausible explanations for performance differences. 



Discussion 

We have demonstrated that combined network modeling of 
CNAs and mRNA expression levels in tumors generates 
mechanistically and prognostically informative results. The 
network-based survival scores introduced here serve to 
identify molecular features useful for predicting the outcome 
of individual patients, adding to our understanding of survival 
differences in the TCGA cohort (Verhaak et aU 2010) . Extensive 
computational and experimental assessments confirm EPoC as 
an efficient and robust methodology to interpret CNA-mRNA 
profiles of ghoblastoma. Applying our method to other tumors 
is facilitated by an efficient solver in R and MATLAB packages 
(Materials and methods) . 

Possible limitations 

CNAs often span multiple genes in large chromosomal regions 
(sometimes a whole chromosome or chromosome arm), 
introducing copy number correlations between genes in the 
affected region. This may lead to erroneously identified CNA- 
mRNA couplings in network construction. EPoC tries to 
address this in two ways: first, each transcript's mRNA signal 
is corrected by its own CNA value (Materials and methods), 
which largely de-correlates the transcript's signal with 
neighboring CNAs, thus reducing the risk of including false 
couplings in such regions; second, the bootstrap procedure 
will work against CNA-mRNA associations that cannot be 
resolved at the single gene level. Empirically, this seems 
to work well in ghoblastoma, as known oncogenes and tumor 
suppressors are recovered as single gene hubs by the 
algorithm. However, we also note cases where EPoC does 
not resolve a single regulator (e.g.. Chromosome 10, see 
Results section). While our tests show good support for 
pursuing a gene-level approach for ghoblastoma, optimally 
modeling larger genomic regions is an interesting future 
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research problem; possible approaches to explore are cluster- 
ing of CNA profiles into regions (Peng et aU 2008), adapting 
statistical techniques from linkage analysis (Jiang and Zeng, 
1995) or using annotation features as regulatory priors (Lee 
et al 2009). 

Our analysis focuses on the full set of ghoblastoma patients 
in the TCGA compendium. We also considered subtype- 
specific models, using the recent classification of the TCGA 
ghoblastoma cohort into Classical, Mesenchymal, Proneural 
and Neural subtypes (Verhaak et aU 2010), but found that 
the number of patients within each subtype is too smah to 
produce a robust network model (as indicated by the boot- 
strap procedure, data not shown) . We reserve work on specific 
cancer subtypes for when more patients become available, and 
suggest a rule of thumb that EPoC be apphed to 100 patients or 
more. Future data sets may also involve finer anatomical 
sampling, helping to model distinct sub-populations of cehs in 
ghoblastoma tumors. 

Other approaches 

In this article we have explored and compared several methods 
for the analysis of CNA and mRNA data. We briefly discuss 
some additional methods here. 

CCA is a traditional multivariate technique, and has recently 
been extended and applied to integrate CNA and mRNA data 
(Waaijenborg et aU 2008; Witten et al 2009). CCA links 
modules of CNA to modules of mRNA, i.e., identifies a 
sequence of linear combinations of subsets of CNAs maximally 
correlated with linear combinations of subsets of mRNA. 
Thus, CCA does not provide a gene-level network model, but 
opts to summarize CNA-mRNA interactions at a module level. 
CCA treats CNA and mRNA data symmetrically. Therefore, 
CCA components (module-module links) do not generally 
agree with the survival score decomposition of the CNA-driven 
network G, except under very unrealistic assumptions 
(Materials and methods and Supplementary information). 
An investigation to further compare module-level and gene- 
level methods, and alternative decomposition techniques is 
reserved for future work. 

A second alternative approach to integrate CNA and mRNA 
data would be to adapt the electrical circuit-inspired model, 
eQED (Suthram et al, 2008; Kim et al, 2010), which links 
genetic perturbations to transcriptional responses over a 
predefined network from molecular interaction databases. 
Clearly, the key distinction between this approach and EPoC 
is the use of PPI and other data as a scaffold on which to 
construct the model. In work by Zhu et al (2004, 2007, 2008), 
it is suggested to use SNP-eQTL priors to guide the construc- 
tion of mRNA-based transcriptional networks (for a perfor- 
mance test against a lasso-based method, see Lee et aU 2009). 

Future directions 

For biological fohow-up work, there are several predicted 
couplings between CNA hubs and stem cell markers that may 
be interesting to investigate further (Figure 3B). For instance, 
autocrine signaling loops are known to suppress differen- 
tiation in ghoblastoma (Erlandsson et aU 2006; Dai et aU 2001) ; 
it is therefore interesting that our model connects the PDGFRA 
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and EGFR hubs to at least seven known growth factors 
(including PDGFA, PDGFD, FGF9) and five known feedback 
regulators of cell signaling (including SPRYl, SPRY2 and 
S0CS2). Exploring such links by targeted experiments may 
lead to new insight regarding signaling networks in glioblas- 
toma. In addition, the compound predictions (Figure 3C} can 
be explored in several ways, for instance, by assessing the level 
of synergism between these compounds and standard gho- 
blastoma therapies such as Temozolomide. 

Future methodological work should be aimed at incorporat- 
ing other data types, including miRNA expression, DNA 
methylation and sequence mutations, to model the nonspecific 
perturbation term, r in Equation (3), currently treated as noise 
in the network model estimation procedure. Both modeling 
and model testing might benefit from additional molecular 
network reference data. For instance, TF-target links could also 
be included as a prior for network construction or to guide 
method development. Extending biological-mechanistic 
models to encompass all levels of the regulatory process will 
require careful consideration and important choices of model 
representations. Of particular interest will also be to further 
develop prognostic scoring methods by linking network 
decomposition techniques to more advanced survival model- 
ing, which may be an interesting extension to methods that 
relate principal components in tumor molecular profiles to 
relative hazard (Nguyen and Rocke, 2002a, b). 

Ambitious efforts are currently being undertaken to acquire 
comprehensive genome-scale data sets for several cancer types, 
including the Cancer Genome Atlas, and the International 
Cancer Genome Consortium. The recent data deluge presents 
a challenging opportunity to develop mechanistically and 
clinically relevant models of the data. EPoC is one step in this 
direction, and helps to set the stage for the continued modeling 
efforts in the context of human cancer genome programs. 

Materials and methods 
Glioblastoma data preparation 

We obtained DNA and mRNA molecular profiles from the Cancer 
Genome Atlas (TCGA) data repository (http://tcga.org and TCGA- 
Consortium, 2008). We use level 3 data (discrete copy number 
estimates and mRNA levels for known protein-coding genes) 
generated using Agilent 244 k DNA and Agilent 44 k mRNA, and 
Affymetrix U133 mRNA arrays. For model construction, mRNA and 
Affymetrix mRNA signals are averaged for more stable signal (Verhaak 
et al, 2010). Sex chromosomes are removed from the analysis. 

Prior to analysis with EPoC or other network estimation methods, 
we standardize the amplitude of the mRNA levels, i.e., center each 
gene around its mean expression level and divide by its standard 
deviation. All methods have also been compared on unstandardized data, 
but standardization substantially improves the consistency of models 
between replicate data sets, as well as facilitates the search for an optimal 
regularization parameter in the sparse regression modeling, as, after 
standardization, a single parameter value can be used for all transcripts. 

Network parameter estimation 

After centering and standardization, and assuming steady state. 
Equation (1) is rewritten as 

Aui + ^ (Wij - Vij)Ayj + (log aj - log ocQ - (log - log p^) = 0, (4) 



(where and refer to mean gene-specific constants across all 
tumors) . Collected for all transcripts i, the above translates to the linear 
equation systems in Equation (2) AAY-\- AU + R^O. Equation (2) can 
be transformed into Equation (3), where we have AY^GAU-\-T, 
G=-A~^ and r^-A~^R. AY is the nxT matrix of mRNA expression 
levels for the n transcripts across the T tumors, similarly for the CNAs 
AU, and G is the nxn CNA-driven network matrix. Below, we outline 
how to solve for parameters of interest in Equation (3) , although this is 
easily recast to estimate parameters in Equation (2) instead. 

Prior to solving for parameters of interests, EPoC uses, by default, an 
optional filter to distinguish candidate (CNAs) from inherited (germ 
line) copy number variations (CNVs) . For each gene, we calculate the 
number of patients with amplification of the gene (M), and the 
number of patients with deletion of the gene (N2). Given a total 
number of changes of Ni + N2, we evaluate P as the binomial 
cumulative probability function for Ni successes in Ni + N2 attempts 
at 0.5 probability. Candidate hubs are selected as genes, for which P < 6 
(bias toward deletion) or P> 1-5 (bias toward amplification). For our 
analysis, we set 6=10~^, thus including 25% of the genes as possible 
regulators. The key difference when the filter is not applied is that the 
CNV gene GSTTl is selected as a hub; this gene has both gains and 
losses, indicating no selection by the tumor, and was also seen in 
ovarian cancer data from TCGA (data not shown), and is located 
within a known CNV. We expect that the recurrent CNA detection 
programs RAE or GISTIC could also be used in this step with good 
results (Beroukhim et al, 2007; Taylor et al, 2008). 

For each gene i we first estimate the direct effect of the gene's own 
CNA by the positive truncated least-squares estimate: d=max(0,A(7f 
AYi), where AYt is the T x 1 vector transpose of the i-th row of AY, and 
similarly for A[/j. From the CNA filter operation above, we obtain the 
set of H candidate hubs and denote the corresponding CNA values by 
AUh, where AUnis a.H x T matrix. We then solve the nil regularized 
regression problems (one for each gene) , treating r terms as noise in 
the estimation: 

min ||(AY, - dAU,) - AU^^fi.Wl + ^ E l^'WI' 

where AUhv=A[/h[\^VX i.e., the AU^ matrix excluding gene i, and Gj 
denotes the Hxl vector transpose of the i-ih row in G with elements 
corresponding to the hub set H, but excluding gene i, and X is the 
regularization parameter that controls the degree of sparsity (number 
of non-zeroes) in G. Gjlj] denotes the j-th element in vector Gj. We solve 
Equation (5) using the cyclic coordinate descent algorithm (Fu, 1998; 
Friedman et al, 2007). Following Friedman et al (2007), we speed up 
the computation using the fact that X>AUH\i\\AYi-dAUi\\ao implies 
that Gi=0 (Osborne et al, 2000), meaning that we need not search for a 
model for transcripts that meet this criterion as the optimal model will 
be empty. As a global upper limit for X, we define A.niax=niaXjA[/H\i 
||Ay~dA[/,||oo, for which all elements of G will be zero. The EPoC 
algorithm is summarized below (Box 1). Note, the same algorithm can 
be used to estimate the transcriptional network A by simply replacing 
AUH\i with the mRNA data AY\i. 



Optimizing the size of the network 

The size of the estimated network is controlled by the lasso penalty 
parameter X. To determine an appropriate value for X between 0 (fully 
connected model) and X^^^ (smallest, non-empty model) we consider 
two different validation criteria. 

We first compare network consistency using Kendall's W (Kendall 
and Smith, 1939), commonly used to assess agreement among rank- 
order lists and lately applied for network inference (Vacher et al, 2008; 
Milns et al, 2010). Here, we rank the network edges (presence and 
magnitude of an interaction) from different network estimates. If the 
rank lists agree completely, W is 1 , and if the rank orders exhibit a 
random overlap. Wis 0. Kendall's Wis analogous to rank correlation 
but, as it can compare several rank lists instead of only two, we chose 
to use this measure instead of Spearman's rank correlation for future 
scalability. Here, we randomly split the data set into two non- 
overlapping groups, independently estimating a network for each 
group, and there are thus two rank lists to compare. For more robust 
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Box 1 EPoC flowchart 

1. Data preparation 

Center the mRNA and CNA data, constructing the two n x 7 data 
matrices AY and Ail. 

Reduce the CNAs to candidate hubs using the germline binomial 
filter described above. Collect the candidate CNA hubs in the 
H xT data matrix AUh- 

2. Estimate tlie direct effect of eaclri transcript's own CNA 
For each gene /, 

estimate the direct effect of the gene's own CNA by 
cy=max(0,AL//^Ay/), where AY, is the 7 x 1 vector transpose 
of the /-th row of AY, and similarly for AL/,- . 

3. Estimate elements of G using lasso 
For genes /=1, ... ,n, solve the lasso problem 

min \\{AYi - dAUi) - AUlyGi\f, + XY1 l^'WI' 

jeH\i 

where G,- denotes the vector transpose of /-th row in the gain matrix 
G with elements corresponding to the hub set H, but excluding 
gene /. 

4. Optimal network size 

Randomly split the data into two groups. Apply steps 2 and 3 for 
different values of X to each group, and use Kendall's W to 
compute network agreement. Select the optimal X that maximizes 
W (average W over 1 000 random splits) and record the 
corresponding network size S. (Alternatively, here optimize for 
mRNA prediction.) 

5. Final network 

Generate 8=1000 pseudo-bootstrap data sets. For selected X in 
step 4— find the corresponding X' that generates networks of size S 
on the bootstrap data set (applying steps 2 and 3 above). The 
bootstrap networks are denoted G^, b=^, ... ,B. Collect 
interaction frequencies \{ij}=J2b=^^G'^{iJ]f^)/B. The final 
network consists of interactions with f{ij}>f (here with f =0.2). 
EPoC has been implemented in both R and MATLAB (with C) and 
the software is available for download at http://sysbio.med.gu.se/ 
epoc.html 



inference, the above validation procedure is repeated 1000 times. The 
final choice of X, or network size, is based on the mean value of 
Kendall's W across the 1000 random splits (Figure 2 A, upper panel). 

We also assess the prediction power of our method. When mRNA 
prediction is the goal, we need to optimize network size with this in 
mind. We thus compute the prediction errors for each mRNA transcript 
using cross-validation. Leaving out one 10th fraction of the data, we 
estimate the gain matrix: G^^\ For the left:<)ut pjortion of the data, we 
can now predict the mRNA transcript by AY'''^^=G^'^^AU^^^ and compare 
with the true observed mRNA expression levels AY^^. Note, AY'-'^^ 
^[/(^) j-gfgj. ^0 the left-out data, whereas the estimate G^^^ refers to the 
estimate based on all data, except the left-out portion. The cross- 
validation prediction error is defined as 

igllAyW-gWAuWll^ 



The validation procedure is repeated 1000 times. The final choice of X 
is based on the mean prediction error across 10-fold random splits of 
the data (Figure 2 A, lower panel) . We note that networks optimized for 
prediction are larger than networks optimized for robustness. 

For robust inference, we produce a final network model by repeating 
the estimation and validation procedures several times using so-called 
pseudo-bootstrap (Friedman et al, 2000). We generate bootstrap 
samples as follows: for each bootstrap simulation 5=1 , . . . ,B, we create 



pseudo-bootstrap mRNA data as Ay^=(l-c)Ay + cAy* (here B=1000 
bootstrap simulations). AY* is a ^ x Tdata matrix, where each column 
(tumor sample vector across genes) has been randomly sampled from 
the columns in AY. That is, in AY^ each column represents a weighted 
combination of one tumor sample vector with another sample. 
The constant c is small, here set to 0.01. The pseudo-boostrap 
CNA data, AU'^, is obtained in exactly the same way. We then estimate 
the boostrap network by applying EPoC to the data set [AU^, AY^). 
We collect frequency information for each interaction as f{i,j} = 
J^fa=il(G^{iJ} 7^0)/B, i.e., f{i,j} is the proportion of bootstrap samples 
for which the interaction i <- j is present in the network. Large values of 
f{i,j} suggest that the interaction is real, whereas small values suggest 
it is detected just by chance (numerical instabilities in the estimation 
procedure) . We compute a cutoff for f using permutations of the data 
(Figure 2B) and pick a frequency threshold of 20%, which is well 
above interaction frequencies expected by chance. We visualize the 
obtained networks using Cytoscape (Shannon et al, 2003). 



Network-based survival score 

The singular value decomposition of G is G=CAD^ (where CC^^I, 
DD^^I and A is diagonal) . The right singular vectors (columns) in D 
represent the leading directions of CNA perturbations that are 
amplified by the system G. The left singular vectors (columns) in C 
represent for which mRNA transcripts these perturbation signals result 
in the largest effects. To aid in interpretation and identify a small set of 
potential prognostic biomarkers, we here construct the SVD of G using 
a sparse PCA method (Zou et al, 2006). This is based on a regression 
formulation of PCA, and employs a combined LI and L2 penalty 
(a.k.a elastic net] to identify the non-zero loadings of the principal 
components. The sparse SVD component D is obtained through a 
sparse PCA of GG\ and the sparse component C is obtained through a 
sparse PCA of G^G. The level of sparsity is chosen such that (i) we 
obtain a reasonable set of biomarkers for possible therapeutic follow- 
up, and (ii) the solution is stable across neighboring values of the 
sparsity penalty. 

The mRNA profiles of individual patients are projected onto the 
singular vector space by Zy^C'^AY (rows of Zy will be components, 
columns will be patients); and CNA profiles are projected by Zu=D^A[/ 
(rows of will be components, columns will be patients). For the 
different components of Zy and we thus compare patients z > 0 and 
z<0 in terms of clinical survival (from date of surgery to date of 
death) ; survival difference P- values are obtained by the log-rank test. 

One could consider constructing alternative biomarker modules to 
the above using sparse canonical correlation (CCA; Waaijenborg et al, 
2008; Witten et al, 2009). In the Supplementary information, we 
discuss this alternative in more detail, but note here that sparse SVD of 
G can disagree substantially from CCA. The SVD of G focuses on CNA 
as the input or driver of mRNA changes, whereas CCA treats CNA and 
mRNA symmetrically. In our toy example, SVD of G correctly identifies 
CNA biomarkers and mRNA responders (Figure IC). In contrast, CCA 
is susceptive to, and reflective of, the structure of the noise term of 
Equation (3). This is a major concern in our ghoblastoma data set, 
where we know that the noise structure is non-negligible, capturing all 
the mRNA-mRNA dependencies that are non-CNA related. 



Method comparisons 

A detailed description of the methods is found in the Supplementary 
information. 



Structural consistency tests 

We first construct two replicate versions of the TCGA data set, A and B. 
A comprised array-CGH and Agilent array measurements from 
MSKCC; B comprised Agilent array-CGH profiles and Affymetrix 
U133A mRNA profiles generated at Harvard and Broad Institute, with 
both A and B consisting of 146 individually matched samples. For 100 
iterations, we select a mixture of the 250 genes with the highest mRNA 
variance in one of the data sets, plus an additional random 250 genes 
from the 10 672 genes studied. This way, we get a set of genes that can 
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be analyzed also by the slowest methods (remMap, glasso, ARACNE), 
and which introduces a bias in favor of the methods that uses mRNA 
data only. We subsequently run each method for each of a series of 
parameter values corresponding to stringency, resulting in a series of 
networks of different sizes. The parameters tuned are glasso p, 
ARACNE dpi, GeneNet significance threshold, EPoC X, remMap LI 
penalty, LirNet LI penalty and eQTL P-value cutoff. remMap has an 
additional L2 parameter, which is tuned for optimal performance (W). 
Similarly, LirNet is optimized to perform well by (i) using the same set 
of initial clusters/modules in the A and B data; (ii) by optimizing the 
number of modules and the L2 penalty. For all methods, we compute 
network agreement between data sets A and B using Kendall's W 
(Figure 6A). 

Pathway comparisons 

We download Reactome, IntAct and HPRD from Pathwaycommon- 
s.org, and map identifiers to the 10 672 genes in our data set. We 
subsequently calculate the undirected shortest path Rij for all gene 
pairs in these databases using Johnson's algorithm (Johnson, 
1977). For a given network G, we then calculate the enrichment 
(relative proportion) as: 

P{Rij ^ k\i and j are connected in G) 
P{Rij = k\i and j are connected in a permutation of Gpermuted) ' 

calculated across non-diagonal elements [i^j] and where Gpermuted is 
generated by random permutation of the non-diagonal G elements 
(1000 simulations; Figure 6B). For Figure 6B, /c=2 is used. 



Comparison of growth rates 

For U-343MG, cell growth of NDN-expressing and cells transfected 
with insert-free control plasmid was measured using the MTT 
colorimetric assay. Initially, 1500 cells were seeded in 96-well plates, 
with 11 replicates for each time point and treatment. At the respective 
time point, cells were incubated with 20 |il of 5 mg/mP^ MTT (3-(4,5- 
dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide) dissolved in 
PBS. After 6h at 37°C, media were removed and formazan crystals 
were dissolved in DMSO, and absorbance was measured at 570 nm. For 
U-87MG, 75 000 cells/well were seeded in six-well plates and counted 
after 24 days. For U-373MG and T98G, 200 000 cells/well were seeded 
in six- well plates and counted after 7 days. Cell number was measured 
using a Chemometec cell counter. 



Statistical model to test growth rate differences 

We assume an exponential model for the growth of glioblastoma cells 
in culture, h{t)^ho2'^\ in which where t=time (days), h oc number of 
cells and ho is the value of h at t=0. Following log transformation, this 
is expressed by the linear equation Iog2{h{t]]^log2{ho] + kt, which 
is in good agreement with experimental data (not shown). The 
constant k is estimated by linear least squares, with confidence 
interval k±Sk- ^0.975,^-2. where 5^ is the empirical standard deviation 
of k. The time tj^, it takes to double the number of cells, is given by the 
inverse of k, i.e., t^^l/k, which has confidence interval 

= ■24x--— ^ hours (6) 

k^ Sk • fo.975,n-2 



Prediction of mRNA levels 

We calculate the 10-fold cross-validation error for the remMap, LirNet 
and EPoC methods on random sets of 500 genes (above) (due to speed 
constraint of remMap). Each method is tuned to perform well by 
adjusting LI and (when applicable) L2, and module number. 



Experimental methods 

Cell culture and perturbation of NDN, PDGFRA and 
EGFR 

Human U-343MG glioblastoma cells were cultured in Dulbeccos 
modified Eagles medium (DMEM; Lonza) supplemented with heat- 
inactivated 10% fetal bovine serum, 100 units/ml penicillin and 
O.lmg/ml streptomycin. Cells were kept in a 37°C humidified 
incubator containing 5% CO2. For transfection, 1.5 x 10^ cells were 
seeded in a six-well plate, and the FuGENE 6 transfection Reagent 
(Roche) was used. The cells were transfected by the FLAGC-NecF 
plasmid containing Flag-tagged NDN and the neomycin selection 
cassette. For generating stable transfectants, 500|ig/ml neomycin 
was used, whereas for continuous growth 250|ig/ml was added. 
Two different cell lines [NDN+ and NDN++] were generated 
that expressed NDN at different levels. Negative control lines were 
generated by the same protocol, but with an insert-free plasmid. In 
addition to this experiment, three other cell lines. Human U-87MG, 
T98G and U-373MG (Cell Lines Service, Germany), were transfected 
with the same plasmid. U-343MG was re-transfected and previous 
results were confirmed. Conditions and procedures were similar, 
except for the transfection procedure, where cells were seeded in six 
replicates in six-well plates (0.7 to 4.3 x 10^ cells/well) and transfected 
with OPTI-MEM, lipofectamine and PLUSreagent (Invitrogen Corp.). 
To perturb the PDGFRA node, PDGF-AA was added at 30ng/ml. To 
inhibit cytoplasmic tyrosine kinases, we used STI-571 (Gleevec) at 
1 |iM. Cells of 50-70 % of confluence were grown in a six-well plate and 
in serum-free DMEM medium for 1 day before PDGF-AA and STI-571 
treatments. The duration of PDGF-AA and STI-571 treatment was 14 h. 
To perturb the EGFR node in U-87MG, U-343MG, U-373MG, T98G and 
A172, EGFR inhibitor Gefitinib (Selleck Chemicals) and EGFR ligand 
EGF (supplied by Peprotech) was added at 5|iM and 50ng/ml, 
respectively. For the Gefitinib-treated cells, controls were treated with 
equal concentration of DMSO. Cells were seeded in triplicates at 5 x 10^ 
cells/ well in a six- well plate 24 h before addition of Gefitinib and EGF. 
Cells were treated for 6h at 37°C. 



In our experiment, we tested the hypothesis that cells that over- 
express Necdin (NDN) grow slower than cells that have been 
transfected with a negative control plasmid (CTRL). Thus, our null 
hypothesis Hq is that /cndn^^ctrl and our alternative hypothesis Hi is 
that /cNDN<fccTRL- To compare the doubling rates /cndn and /cctrl> 
we calculate a T statistic: 

rj. fccTRL - ^NDN , /„x 
I = I ^ t2n-4 [0 

V^CTRL + '^NDN 



Detection of Necdin expression 

For U-343MG, cells of 50-70% confluence were solubilized with lysis 
buffer (150 mM NaCl, 5mM Tris, pH 8, 0.5% deoxycholate, 10% 
glycerol, 1% NP-40, 1 mM PMSF, 1 mM DTT, 1 mM aprotonin). 
Samples were boiled for lOmin and protein content was measured 
using Bio-Rad protein-assay Dye Reagent. Samples of 10|ig total 
protein were analyzed on a 10% SDS-PAGE. After electrophoresis, 
the proteins were transferred onto a nitrocellulose membrane 
(Amersham) with a semi-dry transfer equipment (Bio-Rad). The 
membrane was processed for immunoblotting using an anti-FLAG 
primary antibody (1:2000, Sigma- Aldrich) at 4°C over night and a 
secondary anti-mouse antibody (1:10 000, GE Healthcare) for 1 h. To 
detect signal, the membrane was developed using the ECL Advance 
System (GE Healthcare) according to the manufacturer's protocol and 
scanned using LAS- 1000 Plus (Fujilm). 



qPCR analysis 

Cells cultured in a six-well plate were harvested and RNA isolated 
using the TRIzol reagent (Invitrogen) and Ethanol precipitation. We 
synthesized cDNA by annealing oligo-dT primers and elongating with 
M-MuLV reverse transcriptase (Fermentas) according to the manufac- 
turer's protocol. To avoid DNA contamination, the TURBO DNA-free 
kit (Ambion, Applied Biosystems) was used. We measured levels 
of transcripts and estimated Q values for GAPDH, NDN, CPNE8, FGF9 
and KCHN8 (primer sequences in Supplementary information) using 
the StepOne Real-Time PGR system (Applied Biosystems). Primer 
sequences were selected using the software Primer Express 3.0 
(Applied Biosystems). For qPCR analysis of S0CS2 and NR2E1, cells 
cultured in a six- well plate were harvested and cellular total RNA was 
extracted using the RNeasy Plus Mini Kit (Qiagen), according to the 
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manufacturer's protocol. cDNAs were synthesized from total RNA 
(1 |ig) using random primers according to the protocol (High Capacity 
cDNA Reverse Transcription kit. Applied Biosystems). The expression 
levels of human S0CS2 and NR2E1 mRNA was evaluated using 
quantitative real-time PGR (TaqMan Gene Expression Assays, Applied 
Biosystems). Each reaction was run according to manufacturer's 
protocol (Applied Biosystems). TaqMan Gene Expression Assays used 
were Hs99999903ml for human ACTB; Hs00919620ml for human 
S0CS2; and Hs01128417ml for human NR2E1/TLX. The reaction was 
run using an ABI PRISM 7900 HT Sequence Detection System (Applied 
Biosystems). Data were collected and recorded by ABI PRISM 7900 
SDS Software (Applied Biosystems). Samples were run in duplicates. 



Statistical analysis of qPCR experiments 

Quantity levels (on an arbitrary scale, using four-point standard 
curves) were estimated using Applied Biosystems software. Statistical 
testing was conducted to test the following null hypothesis (using a 
four-sample, unequal variance t-test) : log (quant, of test gene in treated 
cells) -log (quant, of Ctrl gene in treated cells) =log (quant, of test gene 
in normal cells) -log (quant, of Ctrl gene in normal cells) . The validity of 
using a t-test was confirmed by assessing the distribution of residuals 
(difference between log (quantity) and the mean of log (quantity)). 
Residuals were compared with a normal distribution by the 
Kolmogorov-Smirnov test, showing no evidence (P-value=0.1793). 
We used both ACTB and GAPDH as housekeeping controls. 

Supplementary information 

Supplementary information is available at the Molecular Systems 
Biology website (www.nature.com/msb). 
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